-
-
Notifications
You must be signed in to change notification settings - Fork 4k
Use bit ops instead of integer modulo and divide in shaders #19994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a rendering expert, but I would appreciate comments above the bitops that show what the equivalent non-bitops are.
let material_id = vertex_input / 3u; | ||
let vertex_index = vertex_input - material_id * 3u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be worse. I thought compilers can see consecutive % and / and combine the instructions into one thing(?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not when its split into a function call that looks like
int naga_mod(int lhs, int rhs) {
int divisor = ((lhs == int(-2147483647 - 1) & rhs == -1) | (rhs == 0)) ? 1 : rhs;
return lhs - (lhs / divisor) * divisor;
}
// ...
let vertex_index = naga_mod(vertex_input, 3u);
let material_id = vertex_input / 3u;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be worse. I thought compilers can see consecutive % and / and combine the instructions into one thing(?)
If you play around on Godbolt you'll notice that it actually emits this QUOTIENT <- DIVIDEND / DIVISOR; REMAINDER <- DIVIDEND - QUOTIENT * DIVISOR
idiom quite often on higher optimizations, and less so on lower optimizations.
Not all division hardware (Goldschmidt comes to mind) computes both results at once (at least not as a ready-to-use integer remainder), and in general I trust shader compilers to do these optimizations on code way less than I do for regular compilers.
let sub_xy = remap_for_wave_reduction(local_invocation_index % 64u); | ||
let x = sub_xy.x + 8u * ((local_invocation_index >> 6u) % 2u); | ||
let sub_xy = remap_for_wave_reduction(local_invocation_index & 63u); | ||
let x = sub_xy.x + 8u * ((local_invocation_index >> 6u) & 1u); | ||
let y = sub_xy.y + 8u * (local_invocation_index >> 7u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 8u is << 3u. Or does naga deal with that properly?
let sub_xy = remap_for_wave_reduction(local_invocation_index % 64u); | ||
let x = sub_xy.x + 8u * ((local_invocation_index >> 6u) % 2u); | ||
let sub_xy = remap_for_wave_reduction(local_invocation_index & 63u); | ||
let x = sub_xy.x + 8u * ((local_invocation_index >> 6u) & 1u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 8u is << 3u
@@ -103,6 +103,13 @@ fn henyey_greenstein(neg_LdotV: f32) -> f32 { | |||
return FRAC_4_PI * (1.0 - g * g) / (denom * sqrt(denom)); | |||
} | |||
|
|||
fn simple_wrap_3(index: i32) -> i32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can index be >=6? If so then this is wrong, if not then I think this function should be named/commented to indicate it is special purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its only used in this file, and its only called with numbers in range 0-5. I called it simple_wrap_3, not implying its modulo, because it is not an implementation of modulo, just something that works for this specific case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment saying that it only works for values 0 to 5, which is fine for its use in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only nit is that this function signature should be unsigned as it's indexing into an array.
Comment could be something simple like // Helper for plane indices, only valid for 0-5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks fine to me.
I noted early in my review that multiplication by powers of two is not changed to left shifts. Does naga handle this or is there no value to it?
Aside from that, just one comment about the simple_wrap_3 function to address.
Integer multiplication has a dedicated instruction on gpus, there's no need to replace it, bit shifts win us nothing in that case. Integer division and modulos are the real expensive ones, they are on the order of 100 cycles, whereas integer multiplies are usually a cycle or two. |
I personally don't agree. I would agree if there were large bitop expressions involved that do more complex things but this PR only uses single expressions like Personally I think commenting these single expressions would just add noise. (But I may also be biased as I'm in firmware and bitops are everywhere.)
GPU compilers can do fused multiply-add if there is an addition in the expression, which in this PR there is. Left-shifting may actually be less performant here. Aside from the helper function everything LGTM. |
For reference, I had no idea reading these simple expressions what their equivalent non bit-ops operations were :) Definitely a skill issue (I have no use for bit operations in my ordinary work, and have a non-conventional background), but we already struggle getting new contributors for rendering code; I'm reluctant to make it more arcane. |
Im not bothering to document this until i can actually prove its worthwhile on at least one platform, would be wasted work otherwise. It didnt help CI runners |
Objective
in place of
%
, even for constant arguments. Some driver toolchains such as FXC then go on to complain about and seemingly not realize they can optimize this.This potentially actually ends up mattering for CI times, but we'll see.
Solution
Alternate solution considered
Testing